========================================================

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

The highest quality is 8, the lowest is 3. Qualities of 5 & 6 occur more often than the other, quality of 7 comes after.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The highest alcohol is 14.90, and the lowest one is 8.40, with the peak count of around 9.5, it’s a right skewed distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

The highest density is 1.0037, and the lowest one is 0.9901. The peak count is the one of around 0.9975, it’s roughly a normal distribution.

Univariate Analysis

What is the structure of your dataset?

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

The dataset contains 1599 obs and 13 variables.

What is/are the main feature(s) of interest in your dataset?

I want to investigate which chemical properties influence the quality of red wines, therefore, quality is the main feature of interest. Other chemical properties are also very important.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

Most of the chemical properties would affect the quality, or at least have some weak correlations with the quality.

Did you create any new variables from existing variables in the dataset?

Not yet, I will create the conditional means variables for other chemical properties with quality in the next section.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

About the alcohol, it’s a right skewed distribution.

Bivariate Plots Section

## 
##  Pearson's product-moment correlation
## 
## data:  wine$fixed.acidity and wine$quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516

Creat a scatterplot for fixed.acidity and quality, it seems that there is not a stong relationship between them according to the regression line. Also, because the quality is displayed as integer, this graph does not show the continuity of the change. So it’s better to get the conditional means of the quality by the fixed.acidity and plot the geom_line.

Plot the scatterplot of the mean values of quality for every specific value of fixed acidity.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192

Create the scatterplot for density and quality, there is a weak negative correlation between them.

Create the scatterplot for the conditional means of quality by density, also, a weak negative correlation between them. It can be seen from the loess that there is a weak neagtive correlation at first, then after the density of 0.9975, there is a slightly positive correlation.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$pH and wine$quality
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139

Weak correlation between pH and quality

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

It seems that there is a medium positive correlation between the two variables regarding its 0.48 efficiency. Check the conditional means of quality by alcohol in the next step.

## 
##  Pearson's product-moment correlation
## 
## data:  quality_alcohol$quality_mean and quality_alcohol$alcohol
## t = 5.9481, df = 63, p-value = 1.301e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4167388 0.7359429
## sample estimates:
##       cor 
## 0.5996846

The graph shows that there is a medium positive correlation between the two variables, however, the loess indicates a converse trend above the alcohol of 14, which is mainly caused by outliners. Remove those outliners to check the loess again in the next step.

After removing those outliners, the loess shows no negative correlation between the two variables. We can say that the alcohol would affect the quality by a medium positive effect. But, if the alcohol concentration is too high, such as above 14, it would affect the quality a little bit.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$sulphates and wine$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

A weak positive correlation

## 
##  Pearson's product-moment correlation
## 
## data:  wine$volatile.acidity and wine$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

Medium negative correlation between volatile.acidity and quality, check the conditional means of quality by volatile.acidity in the next step

## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and quality_mean
## t = -12.49, df = 141, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7943848 -0.6362877
## sample estimates:
##        cor 
## -0.7247403

Medium negative correlation between the two variables can also be indicated.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$citric.acid and wine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

weak correlation

## 
##  Pearson's product-moment correlation
## 
## data:  wine$residual.sugar and wine$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164

weak correlation

## 
##  Pearson's product-moment correlation
## 
## data:  wine$chlorides and wine$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066

weak negative correlation

## 
##  Pearson's product-moment correlation
## 
## data:  wine$free.sulfur.dioxide and wine$quality
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606

weak correlation

## 
##  Pearson's product-moment correlation
## 
## data:  wine$total.sulfur.dioxide and wine$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003

weak correlation

## 
##  Pearson's product-moment correlation
## 
## data:  wine$sulphates and wine$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

A weak positive correlation, however, according to the loess, there is lightly negative correlation after 0.75 of sulphates. Perhaps because of insufficient data.

## 
##  Pearson's product-moment correlation
## 
## data:  total.sulfur.dioxide and free.sulfur.dioxide
## t = 36.341, df = 1595, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6452693 0.6989950
## sample estimates:
##      cor 
## 0.673019

There is a medium positive correlation between total sulfur dioxide and free sulfur dioxide, probably because total sulfur dioxide contains free sulfur dioxide, just different proportions in different wines.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$pH and wine$density
## t = -14.53, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3842835 -0.2976642
## sample estimates:
##        cor 
## -0.3416993

There is a meidum negative correlation between pH and density.

Chemical distribution analysis under different quality grades

##    Low Medium   High 
##     63   1319    217

I divide the grades into three groups, grade 3, 4 as “low”, grade 5, 6 as “medium”, grade 7, 8 as “high”. Since alcohol and volatile acidity are the two features influcing the quality most, I will do distribution analysis on this two.

We can see from that higher the quality, higher the alcohol, the center of the distiribution moves to the right.

We can see from that higher the quality, lower the volatile.acidity, the center of the distiribution moves to the left.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features in
the dataset?

Positive correlation with quality: 1. fixed.acidity cor 0.1240516 weak 2. sulphates cor 0.2513971 weak 3. citric acidity cor 0.2263725 weak 4. residual sugar cor 0.01373164 weak 5.alcohol cor 0.4761663 medium

Negative correlation with quality: 1. density cor -0.1749192 weak 2. ph cor -0.05773139 weak 3. volatile acidity cor -0.3905578 medium 4. chlorides cor -0.1289066 weak 5. free sulfur dixiode cor -0.05065606 weak 6. total sulfur dixiode cor -0.1851003 weak

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

  1. free sulphur dixiode and total sulphur dixiode have a medium positive correlation, mainly because total sulphur dixiode contains free sulphur dixiode

  2. A meidum negative correlation between pH and density.

What was the strongest relationship you found?

So far, the strongest relationship is between alcohol and quality. Except the observation for the main features, relationship between free sulphur dixiode and total sulphur dixiode is the strongest.

Multivariate Plots Section

We can add another dimension into the graph using different colors, alcohol and volatile.acidity are the two features affecting the quality most. We can see from that since the quality are discrete numbers, it is a little bit over-plotted.

Using the jitter plot seems to make it better. It can be seen from that higher quality will have higher alcohol and lower volatile.acidity.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$volatile.acidity and wine$alcohol
## t = -8.2546, df = 1597, p-value = 3.155e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2488416 -0.1548020
## sample estimates:
##       cor 
## -0.202288

It can be seen from that higher alcohol has higher quality, and higher volatile.acidity has lower quality. Alcohol and volatile acidity have a relatively weak negative correlatin.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

Alcohol and density have a medium negative correlation, and a higher quality has a higher negative coorelation between them.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

For the top two features affecting the quality, a higher alcohol has a a higher quality and a higher volatile acidity has a lower quality.

Were there any interesting or surprising interactions between features?

A higher alcohol will have a lower density. Higher the quality, higher the negative correlation between them.


Final Plots and Summary

Plot One

This graph shows the distribution of wine grades. Grades 5 & 6 have the highest counts which are all above 600 in this sample. Grade 3 has the lowest count.

Plot Two

Alcohol and wine quality has a medium positive correlation which is the highest among all those features.

Plot Three

This graph shows the relationship between alcohol and volatile acidity, and both of them’s relationship with quality. It can be seen from that there is a negative correlation between them, and also a higher alcohol has a higher quality, a higher volatile.acidity has a lower quality.


Reflection

This original dataset contains 1599 obsevations and 11 variables of the chemical features. I am interested in which chemical features affecting the wine quality most. Number one is the alcohol, wines with a high concentration of alcohol tend to have a high quailty. Then, it’s the volatile acidity, it affects the quality in an converse way. However, other features don’t have strong correlation with the quality. Surprisingly, Alcohol and density have a medium negative correlation, and a higher quality has a higher negative coorelation between them. I can make future improvement through creating different models to analyse those chemical features, such as linear regression and decision tree.